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Abstract —The decreasing costs and increasing speed and 
accuracy of DNA sample collection, preparation, and sequencing 
has rapidly produced an enormous volume of genetic data. 
However, fast and accurate analysis of the samples remains 
a hottleneck. Here we present D^'RAGenS, a genetic sequence 
identification algorithm that exhibits the Big Data handling and 
computational power of the Dynamic Distributed Dimensional 
Data Model (D4M). The method leverages linear algebra and 
statistical properties to increase computational performance while 
retaining accuracy by subsampling the data. Two run modes, Fast 
and Wise, yield speed and precision tradeoffs, with applications 
in blodefense and medical diagnostics. The D'^RAGenS analysis 
algorithm is tested over several datasets, including three utilized 
for the Defense Threat Reduction Agency (DTRA) metagenomic 
algorithm contest. 

□ 

1. Introduction 

With advances in biological and chemical knowledge as 
well as increased world travel, bioterrorism and deadly disease 
outbreaks are real concerns. The mitigation of these threats lies 
in accurate source and organism identification, making rapid 
genetic sequencing and organism identification of particular 
interest in homeland defense. 

Bioterror attacks made headlines in 2001 with the mailing 
of anthrax containing letters to notable media and political 
figures that resulted in 17 cases and 5 deaths. The timing of 
the first letter, shortly after the September 11*^ terror attacks, 
increased public anxiety of the possibility of a foreign attack 
Q. A nine-year F.B.I. investigation ensued that dismissed the 
idea of a foreign assailant when analysis linked the anthrax 
strain and concentration to those previously developed by the 
United States biowarfare defense program Q. Finally, a U.S. 
government scientist was indicted for the crimes 

An example of widespread food home illness lies in the 
2011 virulent Germany E. coli outbreak. A slightly pathogenic 
organism, E. coli typically has minor gastrointestinal effects 
and little chance of lasting damage or death. The Germany 
outbreak however was much more serious and over the course 
of almost three months, the bacteria swept through Germany 
affecting over 4,000 individuals and resulting in 53 fatalities 
Q. Genetic sequencing did play a role in the discovery of a 
bacteriophage insertion carrying the Shiga toxin in the E. coli 
strain O104;H4 that accounted for the severity 0- 
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Despite the sequencing, the source was not correctly iden¬ 
tified as sprouts until 41 days after the first hospitalized case, 
and in the mean time the media and critics ran wild with 
the possibility that the bacteria was a genetically engineered 
bioattack Q. Additionally, the misidentification of the source 
as Spanish cucumbers caused European Union farmers weekly 
losses up to $611 million 0. 

In 2001 and 2011 genetic sequencing machines existed, but 
the lack of widely available fast analysis techniques inhibited 
the identification of the source of the E. coli and anthrax. Since 
the anthrax attacks the United States has developed federal 
laboratories to counter biological terrorism, but fast detection 
and characterization methods are still inadequate 0 - 

Sequencing machines continue to be upgraded. At a frac¬ 
tion of the previous cost, new Next Generation Sequencers 
(NGS) have the capability of sequencing an entire human 
genome in mere days, with smaller genomes in minutes to 
hours. These sequencers combined with fast analytic methods 
will allow for fast detection methods with endless application 
in homeland protection as well as infectious disease, inherited 
disease, cancer genomics, and individualized medicine. 

In general, metagenomics algorithms seek to tackle four 
analysis problems: (1) organism identification, (2) correct 
classification of sequences, (3) gene identification, and (4) 
variant identification. With the goal of finding an algorithm that 
would allow DNA sequencing to become a rapid diagnostic 
medical tool, the Defense Treat Reduction Agency (DTRA) 
sponsored a challenge to find a fast and accurate method 0. 
Several metagenomic algorithms are currently available, some 
as a result of the DTRA challenge. Each algorithm uses a 
wide range of techniques, with trade-offs between time and 
accuracy. 

Many algorithms utilize existing tools, such as the “gold- 
standard” BLAST (Basic Local Alignment Search Tool), and 
modify the results to achieve better identification levels 0 
m HI)- These algorithms, including MetaPhlyer 0 and 
MetaPhlAn | |T0) , can achieve faster speeds by running BLAST 
searches on reduced datasets of marker genes. Other algorithms 
test sequence similarity through the use of computationally 
expensive hidden-Markov-models (HMMs) and creation of 
phylogenetic trees 0 |[T3). 

To avoid cumbersome external programs, HMMs, and 
phylogenetic trees, E) instead segments the sequences into 
short k-mers, and performs sequence comparisons by matching 
the k-mers in a sorted list. The methods MetaCV 113, LMAT 
| |T3 , and Kraken lEZl employ a similar k-mer segmenting 
strategy in combination with phylogenetic trees to discover 
the least common ancestors. 
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Fig. 1. Degree distribution of the unique k-mers in the unique bacterial RNA 
sequences. Vertical lines indicate the percent of data distribution. The data is 
dominated by few common words, or “super-nodes.” Increasing the percentage 
of data used drastically increases the run time as shown by the red curve. 

In this paper we present D^RAGenS, a simplified k- 
mer comparison style algorithm that leverages the sparse 
linear algebra and computational power of the Dynamic Dis¬ 
tributed Dimensional Data Model (D4M). First defined in | |T8| , 
D'^RAGenS (D4M Rapid Analysis of Genetic Sequences) has 
been modified to include the data subsampling performance 
techniques described in | [T9) . The key idea is the segmentation 
of genetic sequences into short k-mers for easy and efficient 
string comparison with a basic matrix multiplication in D4M. 

Furthermore, with two run modes. Fast and Wise, 
D'^RAGenS undertakes the problems of organism identification 
and read classification resulting in a simple, but rigorous 
detection method that scales with increasing data volume. 
Results from several in silica test datasets are compared with 
current algorithms. 

II. Methods 

A. D4M 

Developed at MIT Lincoln Laboratory, D4M is an envi¬ 
ronment for Matlab that blends techniques from sparse linear 
algebra, graph theory, and abstract algebra to create triple¬ 
store format associative arrays. In the triple-store format, arrays 
can have strings and/or numerics as the row, column, and 
value keys, allowing for easy data querying. The structure 
of arrays allows for easy parallelization and increased com¬ 
putation capacity with minimal extra code. Additionally, all 
standard mathematical operations of multiple associative arrays 
are composable, and result in an associative array p0| . 

The architecture of D4M provides the tools necessary 
to create a rapid sequence alignment algorithm that takes 
advantage of simple mathematical properties. 

B. Sequence comparison 

During processing, NGS systems segment input DNA 
sequences into a relatively short length (typically 150-450 base 
pairs (bp) long). The short segments, called reads, are amplified 
to increase concentration, and then the DNA sequence is 
determined. Sequences are then characterized by comparisons 



OGenus Matches 

■ oSpecies Matches 

® 8 10 12 14 16 18 20 22 24 26 28 30 

Common Words Using the Least Common 2% of the Data 

Fig. 2. Normalizing the hit counts by the subsampled sequence length acts as 
a filter for signal and noise. Matches resulting in true organism identification 
(green points) have high normalization values, while false positives (red points) 
are clustered around low normalized values. 

to known DNA, RNA, and protein sequences. DNA and RNA 
sequence comparisons are the focus of this paper. 

Comparing just two sequences as strings is typically an 
order N^ computational problem. An alternative to string 
comparison is converting sequences into sparse vector repre¬ 
sentations and using the mathematical dot product operation as 
a measure of identity. Sequences are split into non-redundant 
overlapping 10 base pair (bp) k-mers, or words and used as 
vector indices. A vector dot produce yields the number of 
exact word matches between the two sequences. Furthermore, 
parallel comparisons between K unknown sequences of length 
N to M reference sequences (order (K x M) x N^) is computed 
with only a sparse matrix multiplication. 

Word length is chosen to optimize the algorithm sensitivity 
and array sparseness. The four DNA bases and 10 bp per word 
leads to a domain of 4^° (1,048,576) unique words. Long DNA 
sequences are divided into segments of 1,000 bp with a 100 
bp overlap prior to word formation. This maximum segment 
length and domain size leads to a 1 in 1,000 chance of random 
word matches between unrelated sequences. 

C. Subsampling of k-mers 

The D4M graph-linear algebra duality allows for the cre¬ 
ation of a bipartite graph between unique words and sequence 
identifiers. An edge exists between the vertex sets if that word 
is contained in the sequence, and the word degree identifies 
the frequency of use. In random sequences, all words occur 
with the same frequency. DNA however, is not random and 
common words represent highly conserved sequences that are 
present in many organisms (e.g. ribosomal RNA genes). The 
nonrandom and repetitive nature of DNA is highlighted in the 
degree distribution of the words (Figure [^, which shows the 
data is dominated by few words of high frequency. 

Frequent word use hampers the identification performance 
with extraneous data, numerous false positives, and long 
computation times. Removal of these supernodes results in 
the dramatic reduction of comparisons between the reference 
dataset and unknown sample, resulting in decreased run time. 
The red curve in Figure [T] shows the algorithm run time as 
a function of the percent of data. Run time increases quickly 
with data usage. Results of in silica generated datasets show 













the organism identification power is retained by measuring 
sequence identity (performing the sparse matrix multiplication) 
with the least common 2% of the data m- 

D. Sequence normalization 

Sequence similarity computed with the least common 
words discovers the truth organisms (based on the addition of 
FastqSim p2) in silico generated sequences to the datasets), 
but the results are plagued by false positive identifications 
caused by the reduced sample space. Normalizing the number 
of common words by the length of the subsampled sequence 
acts a second noise filter, as shown in Figure 

III. Algorithms 

A. Fast 

The Fast algorithm combines the data subsampling with 
sequence length normalization to rapidly identify the organ¬ 
isms present in the sample. After subsampling, the unknown 
sequences and reference database are compared with a sparse 
matrix multiplication. 

The words in common are first thresholded to eliminate 
chance k-mer matches, before selecting matches with nor¬ 
malization above 0.9. In the case of multiple alignments, the 
sample is classified to the least common taxonomic similarity 
between all strong reference matches. 

The two filters work together to remove extraneous false 
matches. Weaker true matches are also removed, but the 
strongest matches remain and allow organism detection even 
for low read numbers. Computational performance is increased 
by parallelizing the sequence comparisons and length normal¬ 
izations. 


B. Wise 

The Wise algorithm expands upon Fast mode to identify 
the organism and correctly classify reads. Fast mode is used 
as a first pass filter to identify target organisms. Then, through 
a second matrix multiplication, the sample is compared to the 
identified target organisms, this time using the full distribution 
of words to maximize the matching power. Common words 
are again thresholded to reduce noise. 

Preselecting target organisms condenses the reference set, 
allowing for a quick multiplication with the full word domain 
and minimizing noise caused by chance matches. The final 
common word threshold helps to reduce noise caused by 
family and genus level matches detected by Fast mode. Again, 
sequence comparison multiplications are performed in parallel 
to increase the comparison rate. 

For both Fast and Wise comparisons, the reference 
databases are built from data in the GenBank database. The 
RNA data is selected to provide a reduced dataset by selecting 
only the longest gene sequence for each organism by genus 
species. The selection eliminates redundancy and optimizes 
algorithm performance. 


IV. Results 

The Fast and Wise modes were tested on several in silico 
datasets and results compared with six competing algorithms. 
Organism identification and read mapping was tested using 
three in silico samples of human data spiked with bacterial 
data (samples A, B, and C) and one in silico viral sample. 
D^RAGenS was run using 60 cores on a Linux cluster with 
D4M. The other metagenomic algorithms were all run using 
60 cores on an HP SMP server. 

Sample A was processed on an Ion Torrent sequencer, 
and contains 379,027 total unknown sequences (reads) with 
an average length of 160 bp. Sample B was sequenced on 
the Roche platform and has a longer average read length of 
363 bp and a total of 399,670 reads. Sample C is Illumina 
data, with 6,038,556 reads of length 150 bp. The data size 
is predominately influenced by the number of reads, making 
sample C the largest. In addition to the human background, 
the three bacterial samples contain in silico spiked bacterial 
organisms. 

These samples were used in the DTRA challenge, and the 
spiked organisms all have pathogenic properties that can be 
exploited by bioterrorists. Klebsiella pneumonia in sample A 
has multi-drug resistance and is common within health care 
facilities | [^ . The bacterial organisms in sample B and C have 
been identified as possible biological weapons that can threaten 
the food supply |24| Brucellosis, a disease caused by 
species in the Brucella genus, including the Brucella abortus 
present in sample B, is easily transmissible between humans 
and livestock, and causes undulant fever in humans | [24l . 
Sample C contains Bacillus anthracis, the organism responsibe 
for the potentially fatal anthrax infection p5) . 

The in silico viral sample was created with FastqSim p2) 
to simulate Illumina Sequencing data. The test set contains 
22 known viruses at various concentrations, with average read 
length of 160 bp. Again, all viruses present commonly infect 
humans, some with fatal consequences, such as the Sudan 
ebolavirus. 

A. Bacterial Organism Identification 

In organism identification, the goal is to rapidly distinguish 
the bacterial organisms from the human background and 
maintain a low rate of false positives. The algorithms were 
scored using five criteria; number of true positives to species, 
false positives to species and genus, false negatives, as well as 
total run time. Results of the seven algorithms on the bacterial 
samples A, B, and C are shown in the radar plots in Figure 
The arrangement is such to showcase the stronger algorithms 
along the perimeter of the chart. 

Five of the seven algorithms correctly identified the spiked 
bacterial organisms in the three samples and received highest 
marks for the true positive and false negative to species. The 
other two, MetaCV v2.3.0 and MetaPhlyer 1.25, are designed 
to identify only to the genus level and thus receive zero false 
positive species by default. The usefulness of the top five 
algorithms is shown in the scores for the remaining categories; 
false positives to genus and species as well as run time. 

The Kraken vO. 10.4-beta algorithm consistently has the 
fastest run time and identified the spiked organism, but reports 
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Fig. 3. Organism identification results for samples A, B, and C. The left plots display the identification power of the algorithms. Each axis represents a testing 
parameter used in the comparison. The charts are designed with each axis radiating outwai'ds from poor to good results. For example, high numbers of true 
positives are desired, thus the true positive axes begin with 0 at the center and increase along the axis length. False positives are the opposite, and begin with 
high numbers at the center and decrease outward to 0. Thus, strongest candidates lie on the perimeter of the chart. The number of false positive species versus 
run times for the fastest five algorithms are highlighted in the figures on the right. 


an excessive amount of false positives. An additional 200-600 
organisms were identified, yielding a non-specific answer. 

The run times and false positives to species of the top five 
algorithms are highlighted in the plots on the right of Figure 
The four methods Metaphlan vl.7.7, Fast D^RAGenS, MetaS¬ 
cope 1^, and LMAT vl.2.1 have reasonable run times and false 
positives. The Fast D^RAGenS algorithm consistently places 
in the top three for both categories (excluding Kraken vO. 10.4- 
beta due to high false positives). 

B. Bacterial Read Mapping 

The correct classification of reads to the organism, or read 
mapping, is a difficult task made especially challenging by 


the similarities of species within the same genus. Often, such 
species can have greater than a 95% identity. In these instances, 
the majority of reads have equally strong matches to several 
species within the true genus, and are thus categorized by 
D^RAGens as the common genus. 

The species within the genus of the spike sample A or¬ 
ganism are distinct enough that the bulk of reads are correctly 
classified to the species level. Samples B and C however were 
created to test the precision of the algorithms and there are 
species within the genus with a high percentage identity. The 
Bacillus anthracis present in sample C is one of three highly 
genetic similar pathogens in the Bacillus genus p5j . 

The number of reads classified to each truth organism 









TABLE I. Number of reads classified as each truth organism 
BY THE SIX detection ALGORITHMS THAT REPORT READ MAPPING. THE 
LAST ROW GIVES THE RESULTS OF RUNNING WISE MODE ON A GDNA 
REFERENCE DATABASE FOR SAMPLES A AND B. 


Algorithm 

SampleA 

Sample B 

Sample C 


Species 

Genus 

Species 

Genus 

Species 

Genus 

Truth 

58,707 

58,707 

49,948 

49,948 

39,996 

39,996 

Wise 

D^RAGenS 

37,486 

37,486 

961 

24,558 

4,107 

21,261 

MetaScope 

53,999 

54,109 

48,675 

49,517 

39,613 

39,867 

Kraken 
vO. 10.4-beta 

49,158 

51,995 

9,287 

49,948 

9,444 

40,184 

LMAT 

vl.2.1 

9,767 

36,339 

106 

36,253 

2,369 

12,492 

MetaPhyler 

1.25 

0 

45 

0 

71 

0 

188 

MetaCV 

0 

75,749 

0 

49,936 

0 

5,847 

(gdna) Wise 
D^RAGenS 

58,707 

58,707 

35,018 

49,303 



are shown in Table As predicted, algorithms perform sig¬ 
nificantly better on sample A than B or C, and are able 
to assign greater portions of the data to the species level. 
Kraken vO. 10.4-beta again performs well, but the usefulness 
is still undermined by the extreme numbers of false positives. 
Again, MetaCV finds no species level matches by design, 
and MetaPhyler 1.25 maps under 0.5% of the reads to the 
genus level. MetaScope performs exceptionally well with a 
read mapping rate greater than 90%. Wise D'^RAGenS and 
LMAT vl.2.1 have similar read mapping performance. 

As mentioned, the D^RAGenS methods use an optimized, 
smaller reference database built from RNA sequences. Since 
RNA is only the coding portion of the genome (i.e. the genes), 
and the bacterial samples contain the coding and non-coding 
portions, all of the unknown sequences will not be detected 
using the RNA database. Considering that 10-15% of bacterial 
genomes are non-coding regions [26) and that many reads 
span coding and non-coding regions, the Wise D^RAGenS read 
mappings are very reasonable. 

To exhibit the read identification power of the Wise mode, 
the algorithm was run using the full genomic sequences (gdna) 
of the truth organisms as the reference database. The tallies of 
read mapping using the gdna are shown in the last row of 
Table Over 93% of sample A reads are correctly classified 
to the species level, and 98% of sample B reads to the genus 
level, bringing the results level with MetaScope. The low 
false positives, reasonable run times and mapping abilities 
place MetaScope, D"^RAGenS, and LMAT above the other 
algorithms. 

C. Virus Identification and Read Mapping 

All seven algorithms were again run on the generated 
virus dataset, and results analyzed for identification and read 
mapping. The virus RNA dataset is about 15 times smaller than 
the bacterial reference. In order to account for the decreased 
size, the percentage of data used in Fast mode is increased from 
2 to 8%. Despite the increase in data usage, the computation 
times remain fast due to the smaller number of viral organisms. 

Organism detection results displayed in Figure show that 
Fast D'^RAGenS, MetaScope, and Kraken vO. 10.4-beta suc- 
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Fig. 4. Results of organism identification on the 22 virus in silico Illumina 
test set. The Fast D'^RAGenS algorithm scores perfect marks. 

cessfully identify all 22 organisms without any false positives. 
LMAT vL2.1 reported 2 false negatives and 3 false positives. 
MetaPhyler 1.25 and MetaPhlan vL7.7 failed to run on the 
virus test set, and MetaCV v2.3.0 only identified bacterial 
matches. 

Read mapping results in Table [n|reveal the four algorithms 
that produced results have similar scores on this test set. 
Compared to bacteria, viral DNA is gene dense, with fewer 
non-coding regions, allowing for high read identification using 
only RNA. In this case, Kraken vO. 10.4-beta is believed to have 
no false positives due to the absence of a human background. 

V. Conclusion 

Along with NGS tools, D'^RAGenS can aid in homeland 
security by detecting bioterror threats and disease outbreaks. In 
2001 and 2011, faster analysis would have resulted in quicker 
implication of the source of the E. coli and anthrax, fewer 
fatalities, reduced public anxiety, and less economic damage. 
The analysis can also be performed on an individual basis at 
health facilities to determine the cause of an illness before 
administering medications. 

The techniques applied in Wise D'^RAGenS naturally lead 
to strain-level identification of pathogens. The pipeline can be 
modified to use a larger, strain specific dataset after initial 
identification of target organisms with Fast mode. Such an 
enhanced dataset would allow for detection of the minute 
differences between strains, leading to increased benefits for 
medical and threat response. Similar extensions can lead to 
gene detection. 

Additionally, the simple string comparisons using matrix 
multiplications and parallel capabilities can be exploited to 
create an ultra-fast multiple sequence aligner. Sequence align¬ 
ers show relatedness of multiple sequences and are useful in 
locating mutations between healthy and diseased cells (such 
as those associated with genetic disease and cancer). 

The basic mathematical filters used in D^RAGenS simplify 
the current algorithms while maintaining a high level of 
accuracy. Results show the algorithm to report comparable 
results to MetaScope and the publicly available LMAT vL2.1 
analysis tools. The high identification rate along with low run 
time and number of false positives places D^RAGenS as a top 
competitor for the best genetic analysis algorithm. 















TABLE II. 


Species-level read mapping results eor the 22 virus in silica dataset. 


Organism 

Truth 

Wise D^RAGenS 

MetaScope 

Kraken v0.10.4-beta 

LMAT vl.2.1 

South American hemorrhagic fever viruses 

Chapare 

961 

614 

947 

958 

952 

Guanarito 

481 

454 

473 

481 

342 

Junin 

291 

287 

291 

291 

204 

Machupo 

196 

190 

195 

195 

183 

Sabia 

97 

70 

97 

97 

97 

Tick-borne enchephalitis complex (flavi) viruses 

Omsk hemorrhagic fever virus 

495 

327 

495 

494 

486 

Alkhumra hemorrhagic fever virus 

294 

296 

294 

294 

0 

Langat Virus 

100 

111 

100 

100 

100 

Influenza 

Influenza A virus 

655 

560 

605 

624 

0 

Influenza B virus 

411 

337 

381 

401 

186 

Influenza C virus 

120 

91 

115 

118 

75 

Filoviridae 

Sudan ebolavirus 

17,277 

16,809 

16,804 

17,274 

16,976 

Marburg marburgvirus 

175 

175 

175 

175 

135 

Papillomaviridae 

Human papillomavirus type 32 

1,457 

1,270 

1,457 

1,457 

1,405 

Human papillomavirus type 5 

354 

329 

350 

354 

377 

Canine papillomavirus 3 

71 

67 

71 

71 

71 

Delta papillomavirus 1 

7 

4 

7 

7 

7 

HIV and similar 

Human immunodeficiency virus 1 

839 

790 

737 

839 

2 

Human immunodeficiency virus 2 

475 

319 

473 

475 

312 

Simian immunodeficiency virus 

88 

18 

88 

88 

85 

SARS and similar 

Severe acute respiratory syndrome-related coronavirus 

273 

271 

273 

273 

134 

Human coonavirus HKUl 

137 

137 

137 

136 

133 

False Negatives 


0 

0 

0 

2 

False Positives 


0 

0 

0 

3 
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